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(57) ABSTRACT 

A shared storage distributed file system is presented that 
provides applications with transparent access to a storage 
area network (SAN) attached storage device. This is accom- 
plished by providing clients read access to the devices over 
the SAN and by requiring most write activity to be serialized 
through a network attached storage (NAS) server. Both the 
clients and the NAS server are connected to the SAN- 
attached device over the SAN. Direct read access to the SAN 
attached device is provided through a local file system on the 
client. Write access is provided through a remote file system 
on the client that utilizes the NAS server. A supplemental 
read path is provided through the NAS server for those 
circumstances where the local file system is unable to 
provide valid data reads. 

Consistency is maintained by comparing modification times 
in the local and remote file systems. Since writes occur over 
the remote file systems, the consistency mechanism is 
capable of flushing data caches in the remote file system, and 
invalidating metadata and real-data caches in the local file 
system. It is possible to utilize unmodified local and remote 
file systems in the present invention, by layering over the 
local and remote file systems a new file system. This new file 
system need only be installed at each client, allowing the 
NAS server file systems to operate unmodified. Alterna- 
tively, the new file system can be combined with the local 
file system. 
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STORAGE AREA NETWORK FILE SYSTEM cation before the data is actually written to non-volatile 

^ storage. Eventually, the cached data is written to the storage 



FIELD OF THE INVENTION devices. 

The state of the cache depends upon the consistency 

The present invention relates general to computer file 5 between the cache and the storage devices. A cache is 

systems. More specifically, the present invention involves a "clean" when its contents are exactly the same as the data 

distributed file system that transfers data using both network stored on the underlying storage devices. A cache is "dirty" 

attached storage (NAS) and storage area network (SAN) when its data is newer than the data stored on storage 

protocols. devices; a cache becomes dirty when the file system has 

10 written to the cache, but the data has not yet been written to 

BACKGROUND OF THE INVENTION the storage devices. A cache is "stale" when its contents are 

older than data stored on the storage devices; a cache 

File Systems becomes stale when it has not been updated to reflect 

The term "file system" refers to the system designed to chan S es to ** ^ stored on Ae stora 8 e devices, 

provide computer application programs with access to data 15 ln order t0 111311113,11 consistency between the caches and 

stored on storage devices in a logical, coherent way. File *• ston « e devices > file s y stems P erfonn " flush " and " mvaM - 

systems hide the detaUs of how data is stored on storage dateP °P eratl0ns on A flush operation writes 

devices from application programs. For instance, storage **** cached *** to ±e stora 8 e devices ^fore returning 

devices are generally block addressable, in that data is control to the caller. An invalidation operation removes stale 

addressed with the smallest granularity of one block; mul- 20 from cache wlthoilt "poking calls to the storage 

tiple, contiguous blocks form an extent. The size of the devic f - Flle s y stems flush or invalida te caches for 

particular block, typically 5 12 bytes in length, depends upon s ^° c We-tan&s of the cached files, 

the actual devices involved. Application programs generally Mm V fi,e s y stems utuize ^ structures called inodes to 

request data from file systems byte by byte. Consequently, store i"*™^ t° each file. Copies of these data 

file systems are responsible for seamlessly mapping between 25 structures ^ maintained in memory and within the storage 

application program address-space and storage device devices " Inodes ***** 3ttribute "formation such as file 

address-space ' ^Pe- ownership information, access permissions, access 

File systems store volumes of data on storage devices. J me s, modification times, and file size. Inodes also contain 

Thetenn%olume"referstothecollectionofdatablocksfor w KSTTS? ff? Z ^oeks.Th«epo,ntersmay 

„„„ _ , _ fi . „ . . „ .30 address single data blocks or address an extent of several 

IZH^ instance. These storage devices conse ^ locto . Ttea(Wresgrfdala blocks contain either 

may be partitions of single physical devices or logical . , , , ^ A , A , A . r 

„ *\ i v • i j • * r actual data stored by the application programs or lists of 

collections ot several physical devices. Computers may have - «. . * , Kxr^^ * • 1 

o^^o ^ n:~i fii * i ♦ T J pointers to other data blocks. With the information specified 

access to multiple file system volumes stored on one or more f A , , , - £l , , r . 

storaae devices y pointers, the contents of a file can be read or written 

. . , 35 °y application programs. When an application programs 

File systems maintain several different types of files, W|ite to files> ^ blocks be by ^ file 

including regular files and directory files. Application pro- system Such allocatiaD modifies the inodes. 

grams store and retneve data from regular files as contigu- Additionally, file systems maintain information, called 

ous, randomly accessible segments of bytes. With a byte- "allocation tables-, that indicate which data blocks are 

addressable address-space, applications may read and write ^ assigned to files md which m available for allocation to 

data at any byte offset within a file. Applications can grow files File systems modify these allocation tables durin file 

files by writing data to the end of a file; the size of the file a ii ocat ion and de-allocation. Most modem file systems store 

increases by the amount of data written. Conversely, apph- allocation tables within the file system volume as bitmap 

cations can truncate fi es by reducing the file size to any fie]ds File systems ^ bits to signif y blocks ^ m 

particular length. Applications are solely responsible for 45 ently a i loca ted to files and clear bits to signify blocks 

organizing data stored withm regular files, since file systems available for future allocation 

are not aware of the content of each regular file. terms rea l-data and metadata classify application 

Files are presented to application programs through direc- program data and file system structure data, respectively, ln 

tory files that form a tree-like hierarchy of files and subdi- other words, real-data is data that application programs store 

rectories containing more files. Filenames are unique to 50 in regular files. Conversely, file systems create metadata to 

directories but not to file system volumes. Application store volume layout information, such as inodes, pointer 

programs identify files by pathnames comprised of the blocks, and allocation tables. Metadata is not directly visible 

filename and the names of all encompassing directories. The to applications. Metadata requires a fraction of the amount 

complete directory structure is called the file system of storage space that real-data occupies and has significant 

namespace. For each file, file systems maintain attributes 55 locality of reference. As a result, metadata caching drasti- 

such as ownership information, access privileges, access cally influences file system performance, 

times, and modification times. Metadata consistency is vital to file system integrity. 

File systems often utilize the services of operating system Corruption of metadata may result in the complete destruc- 

memory caches known as buffer caches and page caches. tion of the file system volume. Corruption of real-data may 

These caches generally consist of system memory buffers 60 have bad consequences to users but will not affect the 

stored in volatile, solid-state memory of the computer. integrity of the whole volume. 

Caching is a technique to speed up data requests from I/O Interfaces 

application programs by saving frequently accessed data in I/O interfaces transport data among computers and stor- 

memory for quick recall by the file system without having to age devices. Traditionally, interfaces fall into two categories: 

physically retrieve the data from the storage devices. Cach- 65 channels and networks. Computers generally cornmunicate 

ing is also useful during file writes; the file system may write with storage devices via channel interfaces. Channels pre- 

data to the memory cache and return control to the appli- dictably transfer data with low-latency and high-bandwidth 
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The I/O interface links 130 connect to the SAN 128, 
which consists of network components such as routers, 
switches, and hubs. The SAN 128 may also include com- 
ponents that perform storage virtualization, caching, and 
advanced storage management functions. The SAN devices 
126 are block and object-addressable, non-volatile storage 
devices. The SAN devices 126 may be part of a SAN 
appliance 136 or dedicated storage devices attached to the 
SAN 128. 

The primary read data-path 144 of Nasan is similar to the 
read data -path 132 of prior art SAN environments 120, 
whereas the secondary read data-path 146 is similar to the 
read data-path 114 of prior art NAS environments 100. The 
majority of read transfers take place over the primary 
data-path 144, which passes from the SAN devices 126, 
through the SAN 128, directly to the Nasan clients 142. The 
primary data-path 144 takes full advantage of high-speed 
SAN protocols. However, some read transfers follow the 
secondary data-path 146 and pass from the SAN-attached 
devices 126, through the NAS server 106, across the LAN 
104, en-route to the Nasan clients 142. The state of the 
Nasan environment 140 dictates whether the primary data- 
path 144 or the secondary data-path 146 is used for read 
transfers. 

ihe write data-path 148 of Nasan is similar to the write 25 
data-path of prior art NAS 116 with the difference being the 
Nasan write data-path 148 also includes the SAN 128. The 
write data-path 148 begins at the Nasan clients 142 and 
passes through the LAN 104 to the NAS server 106. The 
server 106, in turn, writes across the SAN 128 to 
SAN-attached devices 126. 

Due to high-speed SAN reads 144, the Nasan file system 
significantly exceeds the file sharing performance and scal- 
ability of prior art NAS solutions. Although Nasan write 
performance is similar to prior art NAS write performance, 
Nasan reads are often ten times faster. Because read opera- 
tions generally outnumber writes five to one, the perfor- 
mance improvement made to reads dramatically increases 
overall system throughput. Furthermore, by offloading reads 
from the NAS servers 106, the Nasan file system substan- 
tially reduces server 106 workloads. With reduced work- 
loads, servers 106 exhibit shorter response times, sustain 
more simultaneous file transfers, and support considerably 
larger throughputs than servers 106 supporting traditional 
NAS 100. 

The Nasan file system transfers read requests across the 
high-speed SAN 128 while serializing writes through a 
central NAS server 106. This serialization leads to write 
transfer rate s that are slower than reads; however, 
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the client-side remote file system layer 156 and redirects 
write requests to the remote file system layer 156. These 
lower layer file systems conduct the actual data manage- 
ment, transport, and storage. 

The local file system 154 of the client provides the 
primary read data-path 144 for Nasan transfers. Because the 
clients 142 do not directly modify the volume stored on the 
SAN devices 126, Nasan software 152 maintains low- 
latency consistency by simply invalidating stale caches of 
the local file system layer 154. 

The remote file system facilitates the secondary read 
data-path 146 as well as write access to files managed by the 
NAS server 106. The Nasan client 142 passes file requests 
to the client-side remote file system layer 156. In turn, the 
15 remote file system 156 on the client 142 transmits these 
requests via NAS protocols to the server-side remote file 
system layer 158 on the server 106. The NAS server 106 
completes the requests by reading data from or writing data 
through the local file system 155 of the server 106 to 
volumes stored on SAN-attached devices 126. Write-serial- 
ization. through the NAS server 106, enables low-latency 
consistency. 

Components of the Preferred Embodiment 
The components and protocols that form the environment 
140 of the present invention range in price, performance, 
and compatibility. In the preferred embodiment, the inter- 
face links 110,130 that connect to the LAN 104 and to the 
SAN 128 may include Ethernet, InfiniBand, and Fibre 
Channel. Over these links 110,130 run a number of different 
the 30 network and channel protocols, including Internet Protocol 
(IP), SCSI-3, Virtual Interface (VI), iSCSI, FCIP, and iFCP. 
The NAS protocols used by the remote file system 156,158 
include Network File System (NFS), Server Message Block 
(SMB), and Common Internet File System (CIFS). The 
present invention is not limited to these specific components 
and protocols. 
Local File System Consistency 
In general, local file systems perform extensive metadata 
and real-data caching. The only consistency management 
40 typically required of local file systems is periodic updates to 
on-disk data structures. Cached data is never invalidated 
because on-disk data is always assumed to be older than 
cached data. 

Within a Nasan environment 140, the NAS server 106 has 
45 read-write access to the local file system volume stored on 
SAN-attached disks 126, while Nasan clients 142 have 
read-only access to this volume. Because the client local file 
systems 154 and the server local file systems 155 may not be 
designed to support SAN environments with multiple com- 
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facilitates extremely low-latency consistency 50 puters, Nasan software 152 must explicitly maintain data 



management. Low-latency consistency enables Nasan cli- 
ents 142 to efficiently transfer files of all sizes. Therefore, 
the Nasan flip system is a general-purpose solution for 
1 readintensivel workloads. 
Nasan Layering 

One embodiment of the Nasan file system utilizes a 
two-tiered layering scheme. Nasan software occupies the 
upper level, while non-modified local and remote file sys- 
tems comprise the lower. The Nasan layer provides a man- 



consistency between storage devices 126 and caches of the 
client local file system 154. 

Consistency Between Local and Remote File System 
Layers 

55 Local 154 and remote 156 file systems utilize separate 
caches within client 142 main memories. After file writes, 
the remote file system 156 cache contains newer data than 
the local file system 154 cache. Nasan software 152 makes 
the local file system 154 cache consistent with the Nasan 
agement framework that facilitates data consistency and 60 environment 140 by explicitly invalidating stale data within 



routes file requests to the appropriate lower level file system. 
All remaining file management functionality is derived from 
these lower layer file systems. 

Referring to FIG. 5, application programs 150 running on 
the Nasan client 142 make file requests to the Nasan file 
system software layer 152. Nasan software 152 redirects 
read requests to the either the local file system level 154 or 
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the cache. 

Nasan software 152 has the option to read from the local 
file system 154 or the remote file system 156. When reading 
from the primary data-path 144, Nasan software 152 first 
determines if data is being cached by the client-side remote 
file system 156. If data is cached, Nasan software 152 
flushes the remote file system 156 cache and invalidates the 
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locked, the Nasan client 142 reads the allocation tables from file system, and further wherein the remote component is a 

the SAN devices 126, modifies the allocation tables, writes separate file system layered under the single file system 

the tables to the SAN devices 126, and then releases the file containing the upper level and local components, 

lock. 7. The file system of claim 1), wherein the upper level 

At step 418, the file is fully allocated for the request range. 5 component submits all read requests to the local component. 

The Nasan write function writes the real-data to the SAN- 8. The file system of claim 1), wherein the upper level 

attached devices 126 via the SAN write data-path 326. Once component submits read requests above a certain size to the 

this real-data write completes, at step 420, the modified local component, and read requests below the certain size to 

on-disk inode is written by the client 142 to the SAN- the remote component. 

attached devices 126 and the file lock is released by issuing 10 9. The file system of claim 1), wherein the upper level 

an unlock request to the client-side remote file system 156. component submits read requests to the local component if 

The remote file system 156 passes the unlock request to the the file size is above a certain size, and the upper level 

server 106 which forwards the unlock request to the server- component submits read requests to the remote component 

side Nasan file system 324. After the file lock is released, the if the file size is below the certain size. 

Nasan write operation completes. 15 10. The file system of claim 1), wherein the upper level 

The invention is not to be taken as limited to all of the component submits all read request to the local component 

details thereof as modifications and variations thereof may except where the local component is not capable of properly 

be made without departing from the spirit or scope of the retrieving data requested in the read request, in which case 

invention. For instance, the present invention was described the upper level component submits the read request to the 

and shown with the SAN and LAN networks appearing as 20 remote component. 

separate, physical networks. However, as is well known in 11. The file system of claim 10), wherein the upper level 

the prior art, it is possible to send SAN protocols and LAN component determines whether the local component is 

protocols over the same physical network. The two networks capable of properly retrieving data requested in the read 

are distinguishable by the protocols that are used to com- request by comparing modification times for a file indicated 

municate between nodes on the network. In addition, 25 in the read request as retrieved from the remote component 

although it is not shown in the drawings, it would be possible and the local component. 

to use a client computer in the present invention as a file 12. The file system of claim U), further wherein the upper 

server that serves file requests from other computers. These level component determines whether the local component is 

other computers would likely have no access to the storage capable of properly retrieving data requested in the read 

area network, but would have the ability to send file requests 30 request by comparing modification times for a directory 

to the client computer of the present invention over a local indicated in the read request as retrieved from the remote 

area network. Because many such modifications and varia- component and the local component, 

tions are present, the scope of the present invention is not to 13. The file system of claim 12), wherein the directory 

be limited to the above description, but rather is to be limited modification times are compared during a lookup function, 

only by the following claims* 35 with the results stored in an inpde structure for the higher- 

What is claimed is: level component. 

1. A file system on a computer handling file read and file 14. The file system of claim 13), wherein the Imodel 
write requests to a SAN-attached storage device comprising: structure for the higher-level component includes the fol- 

a) a local component that communicates with the SAN- lowing: 

attached storage device over a storage area network and 40 a) a handle pointing to a file vnode for the remote 

that interprets metadata stored on the SAN-attached component; 

storage device; b) a handle pointing to a file vnode for the local compo- 

b) a NAS server that communicates with the SAN- nent; 

attached storage device over a storage area network; c) a remote modification time indicating the modification 

c) a remote component that Communicates with the NAS 45 time returned by the remote component; and 
server over a local area network; d) a local modification time indicating the modification 

d) an upper level component that communicates with time returned by the local component, 
application programs, the upper level component sub- 15. The file system of claim 1), wherein the file system is 
nutting all file write requests to the remote component capable of handling additional file requests, with file 
and submitting at least some file read requests to the 50 requests that alter data on the SAN-attached storage device 
local component. being treated similar to the write requests, and file requests 

2. The file system of claim 1), wherein each of the that do not alter data on the SAN-attached storage device 
components are separate file systems in which the upper being treated similar to the read requests. 

level component file system is layered above the local 16. The file system of claim 1), further comprising a file 

component file system and the remote component file sys- 55 server component capable of receiving and responding to 

tern file requests from other computers that are connected to the 

3. The file system of claim 2) utilizing an installable file local area network but not connected to the storage area 
system interface to facility layering between the file sys- network. 

terns. 17. A network of connected computing devices compris- 

4. The file system of claim 3) is the Virtual File System 60 ing: 

interface. a) a local area network; 

5. The file system of claim 2), wherein the remote b) a storage area network; 

component file system utilizes a protocol chosen from c) a SAN-attached device attached to the storage area 

among the following set: Network File System, Server network; % *JSJf 2> 0 

Message Block, and Common Internet File System. 65 d) a server computer attached to the local area network * 

6. The file system of claim 1), wherein the upper level and the storage area network; the server computer 
component and the local component are merged into a single receiving file requests across the local area network and 



US 7,165,096 B2 
23 24 

iii) determining whether file requests that do not modify e) at least one client computer attached to the local area 
the SAN-attached device can be fulfilled through the network and the storage area network; the client corn- 
local file system; puter having: 

iv) treating file requests that do not modify the SAN- i) a remote component in communication with and 
attached device and can be fulfilled through the local 5 making file requests to the server computer over the 
file system as local requests; and local area network; 

v) treating the remaining file requests as remote requests. ii) a local component in communication with and 

39. The method of claim 37), wherein the step of deter- making metadata and real-data requests to the SAN- 
mining whether file requests can be fulfilled through the attached device over the storage area network; 
local file system further comprises: 10 iii) an upper level component serving file requests from 

(1) determining a file involved in the file requests; an application program operating on the client com- 

(2) retrieving a modification time for the file from the P uter > me u PP er level component dividing the file 
local file system; requests from the application program between the 

(3) retrieving a modification time for the file from the remote com P onent md me local component, 
remote file system; and 15 47 - ^ e networ ^ of claim 46), wherein the upper level 

(4) comparing the local and remote modification times. **"P™** submits all file requests having data sizes above 

40. The method of claim 39), wherein the step of deter- 8 ^ *T to *f ^AN-attached device over the storage 
inining whether file requests can be fulfilled tough the ?** *T? ™i*f com P° nent ^ a11 file ^ uests 
local file system further comprises: having data sizes below a certain size to the server computer 

m ;„,/oi;ho*;™ 1 «i ♦ , , A . 20 over the local area network via the remote component. 

(5 mvahdatmg the local file system cache when the two 48 . ^ network of claim 46) whfirein ^ * levd 

modificaton ^times are not identcal. component submits ^ fi]e ^ tQ ^ 

41. The method of claim 39), wherem the step of deter- device over the storage area network via the local compo- 
mining whether file requests can be fulfilled through the nent if me file size is above a certain size, and the upper level 
local file system further comprises: ^ component submits all file requests to the server computer 

(6) obtaining a new modification time from the local file over the local area network via the remote component if the 
system after the local file system cache has been file size is below the certain size. 

invalidated; 49 The network of claim 46), wherein file requests 

(7) comparing the new local modification time with the comprise write requests and read requests, and the upper 
remote modification time in a second comparison; 30 level component submits all write requests to the server 

(8) treating the file request as a local request if the second computer over the local area network via the remote corn- 
comparison finds me modification times to be identical; ponent. and at least some of the read requests to the 
and treating the file request as a remote request if the SAN-attached device over the storage area network via the 
second comparison does not find the modification times local component. 

to be identical. 35 50. The network of claim 49), wherein the upper level 

42. The method of claim 41), further comprising the step component submits all read requests having data sizes above 
of perf orming a lookup function to create an inode for the a certain size to the SAN-attached device via the local 
file, the |mode| having a remote handle pointing to a remote component, and further wherein the upper level component 
vnode for the file and a local handle pointing to a local vnode submits all read requests having data sizes below a certain 
for the file, the remote and local vnodes being used to 40 s ^ e t0 me server computer via the remote component, 
identify the file in the remote and local file systems, respec- 51. The network of claim 49), wherein the upper level 
tively. " component submits all read requests to the SAN-attached 

43. The method of claim 42), wherein the step of per- device via the local component if the file size is above a 
forming a lookup function compares modification times for certain size, and further wherein the upper level component 
the directory containing the file in both the remote and local 45 suonuts a11 read requests to the server computer via the 
file systems to determine whether the local file system remote component if the file size is below the certain size, 
access should be allowed. 52 - The network of claim 49), wherein the upper level 

44. The method of claim 43), wherein a lookup routine is component submits all read requests through the local 
performed for the file for both the local and remote file component except where the local file system is not capable 
systems, and further wherein, if the remote file system does 50 of Dr0 P erl y retrieving data requested in the read request, in 
not find the file, a file not found result is returned to the which u PP er level component submits the read 
application. request to the remote component. 

45. The method of claim 44), wherein if the local file 53 ' ^ of claim 52 )> where in the upper level 
system does not find the file, the remote file system is used com P onen t determines whether the local component is 
to service file requests 55 ca P able of P ro perly retrieving data requested in the read 

46. A network of connected computing devices compris- f^f * by COm f ^ m ^ fi cation times received from the 

F 6 ^ local component with modification times received from the 

. , _ , remote component, 

a; a local area network; 54 ^ network of claim 49)> comprismg a 

b) a storage area network; 60 piu^ty of additional client computers, wherein, in relation 

c) a SAN-attached device attached to the storage area to the client computers, only the server computer is granted 
network; wr Ite access to the SAN-attached device and further wherein 

d) a server computer attached to the local area network all write requests from the client computers are routed via 
and the storage area network; the server computer the remote component in the client computers to the server 
receiving file requests across the local area network and 65 computer. 

further storing and retrieving data on the SAN-attached 55. The file system of claim 46), wherein the client 

device via the storage area network; and computer further has a file server component that receives 
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